Create a new pull request by comparing changes across two branches #1657

GulajavaMinistudio · 2024-06-13T08:59:37Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

### What changes were proposed in this pull request? This pull request optimizes the `Hex.hex(num: Long)` method by removing leading zeros, thus eliminating the need to copy the array to remove them afterward. ### Why are the changes needed? - Unit tests added - Did a benchmark locally (30~50% speedup) ```scala Hex Long Tests: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Legacy 1062 1094 16 9.4 106.2 1.0X New 739 807 26 13.5 73.9 1.4X ``` ```scala object HexBenchmark extends BenchmarkBase { override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { val N = 10_000_000 runBenchmark("Hex") { val benchmark = new Benchmark("Hex Long Tests", N, 10, output = output) val range = 1 to 12 benchmark.addCase("Legacy") { _ => (1 to N).foreach(x => range.foreach(y => hexLegacy(x - y))) } benchmark.addCase("New") { _ => (1 to N).foreach(x => range.foreach(y => Hex.hex(x - y))) } benchmark.run() } } def hexLegacy(num: Long): UTF8String = { // Extract the hex digits of num into value[] from right to left val value = new Array[Byte](16) var numBuf = num var len = 0 do { len += 1 // Hex.hexDigits need to be seen here value(value.length - len) = Hex.hexDigits((numBuf & 0xF).toInt) numBuf >>>= 4 } while (numBuf != 0) UTF8String.fromBytes(java.util.Arrays.copyOfRange(value, value.length - len, value.length)) } } ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? no Closes #46952 from yaooqinn/SPARK-48596. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

…olumnAlias` ### What changes were proposed in this pull request? Rename `parent` field to `child` in `ColumnAlias` ### Why are the changes needed? it should be `child` other than `parent`, to be consistent with both other expressions in `expressions.py` and the Scala side. ### Does this PR introduce _any_ user-facing change? No, it is just an internal change ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #46949 from zhengruifeng/minor_column_alias. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…perations ### What changes were proposed in this pull request? Propagate cached schema in dataframe operations: - DataFrame.alias - DataFrame.coalesce - DataFrame.repartition - DataFrame.repartitionByRange - DataFrame.dropDuplicates - DataFrame.distinct - DataFrame.filter - DataFrame.where - DataFrame.limit - DataFrame.sort - DataFrame.sortWithinPartitions - DataFrame.orderBy - DataFrame.sample - DataFrame.hint - DataFrame.randomSplit - DataFrame.observe ### Why are the changes needed? to avoid unnecessary RPCs if possible ### Does this PR introduce _any_ user-facing change? No, optimization only ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46954 from zhengruifeng/py_connect_propagate_schema. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Simplify the if-else branches with `F.lit` which accept both Column and non-Column input ### Why are the changes needed? code clean up ### Does this PR introduce _any_ user-facing change? No, internal minor refactor ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #46946 from zhengruifeng/column_simplify. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Add docs for SPJ ### Why are the changes needed? There are no docs describing SPJ, even though it is mentioned in migration notes: #46673 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Checked the new text ### Was this patch authored or co-authored using generative AI tooling? No Closes #46745 from szehon-ho/doc_spj. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…a function ### What changes were proposed in this pull request? Fix the string representation of lambda function ### Why are the changes needed? I happen to hit this bug ### Does this PR introduce _any_ user-facing change? yes before ``` In [2]: array_sort("data", lambda x, y: when(x.isNull() | y.isNull(), lit(0)).otherwise(length(y) - length(x))) Out[2]: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) File ~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/core/formatters.py:711, in PlainTextFormatter.__call__(self, obj) 704 stream = StringIO() 705 printer = pretty.RepresentationPrinter(stream, self.verbose, 706 self.max_width, self.newline, 707 max_seq_length=self.max_seq_length, 708 singleton_pprinters=self.singleton_printers, 709 type_pprinters=self.type_printers, 710 deferred_pprinters=self.deferred_printers) --> 711 printer.pretty(obj) 712 printer.flush() 713 return stream.getvalue() File ~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/lib/pretty.py:411, in RepresentationPrinter.pretty(self, obj) 408 return meth(obj, self, cycle) 409 if cls is not object \ 410 and callable(cls.__dict__.get('__repr__')): --> 411 return _repr_pprint(obj, self, cycle) 413 return _default_pprint(obj, self, cycle) 414 finally: File ~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/lib/pretty.py:779, in _repr_pprint(obj, p, cycle) 777 """A pprint that just redirects to the normal repr function.""" 778 # Find newlines and replace them with p.break_() --> 779 output = repr(obj) 780 lines = output.splitlines() 781 with p.group(): File ~/Dev/spark/python/pyspark/sql/connect/column.py:441, in Column.__repr__(self) 440 def __repr__(self) -> str: --> 441 return "Column<'%s'>" % self._expr.__repr__() File ~/Dev/spark/python/pyspark/sql/connect/expressions.py:626, in UnresolvedFunction.__repr__(self) 624 return f"{self._name}(distinct {', '.join([str(arg) for arg in self._args])})" 625 else: --> 626 return f"{self._name}({', '.join([str(arg) for arg in self._args])})" File ~/Dev/spark/python/pyspark/sql/connect/expressions.py:962, in LambdaFunction.__repr__(self) 961 def __repr__(self) -> str: --> 962 return f"(LambdaFunction({str(self._function)}, {', '.join(self._arguments)})" TypeError: sequence item 0: expected str instance, UnresolvedNamedLambdaVariable found ``` after ``` In [2]: array_sort("data", lambda x, y: when(x.isNull() | y.isNull(), lit(0)).otherwise(length(y) - length(x))) Out[2]: Column<'array_sort(data, LambdaFunction(CASE WHEN or(isNull(x_0), isNull(y_1)) THEN 0 ELSE -(length(y_1), length(x_0)) END, x_0, y_1))'> ``` ### How was this patch tested? CI, added test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46948 from zhengruifeng/fix_string_rep_lambda. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…commons-io` called in Spark ### What changes were proposed in this pull request? This pr replaces deprecated classes and methods of `commons-io` called in Spark： - `writeStringToFile(final File file, final String data)` -> `writeStringToFile(final File file, final String data, final Charset charset)` - `CountingInputStream` -> `BoundedInputStream` ### Why are the changes needed? Clean up deprecated API usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed related test cases in `UDFXPathUtilSuite` and `XmlSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46935 from wayneguow/deprecated. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…with spark.sql.binaryOutputStyle ### What changes were proposed in this pull request? In SPARK-47911, we introduced a universal BinaryFormatter to make binary output consistent across all clients, such as beeline, spark-sql, and spark-shell, for both primitive and nested binaries. But unfortunately, `to_csv` and `csv writer` have interceptors for binary output which is hard-coded to use `SparkStringUtils.getHexString`. In this PR we make it also configurable. ### Why are the changes needed? feature parity ### Does this PR introduce _any_ user-facing change? Yes, we have make spark.sql.binaryOutputStyle work for csv but the AS-IS behavior is kept. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46956 from yaooqinn/SPARK-48602. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

### What changes were proposed in this pull request? This PR follows up #46938 and improve the `unescapePathName`. ### Why are the changes needed? Improve the `unescapePathName` by cut off slow path. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #46957 from beliefer/SPARK-48584_followup. Authored-by: beliefer <beliefer@163.com> Signed-off-by: beliefer <beliefer@163.com>

### What changes were proposed in this pull request? The pr aims to upgrade `scala-xml` from `2.2.0` to `2.3.0` ### Why are the changes needed? The full release notes: https://github.com/scala/scala-xml/releases/tag/v2.3.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46964 from panbingkun/SPARK-48609. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…error class ### What changes were proposed in this pull request? Track state row validation failures using explicit error class ### Why are the changes needed? We want to track these exceptions explicitly since they could be indicative of underlying corruptions/data loss scenarios. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests ``` 13:06:32.803 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /Users/anish.shrigondekar/spark/spark/target/tmp/spark-6d90d3f3-0f37-48b8-8506-a8cdee3d25d7 [info] Run completed in 9 seconds, 861 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46885 from anishshri-db/task/SPARK-48543. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

yaooqinn and others added 11 commits June 12, 2024 20:23

github-actions bot added SQL PYTHON STRUCTURED STREAMING BUILD DOCS CONNECT labels Jun 13, 2024

GulajavaMinistudio merged commit ad19577 into GulajavaMinistudio:master Jun 13, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a new pull request by comparing changes across two branches #1657

Create a new pull request by comparing changes across two branches #1657

GulajavaMinistudio commented Jun 13, 2024

Create a new pull request by comparing changes across two branches #1657

Create a new pull request by comparing changes across two branches #1657

Conversation

GulajavaMinistudio commented Jun 13, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?